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Abstract 

Background: Selecting genes and pathways indicative of disease is a central problem in computational biology. 
This problem is especially challenging when parsing multi-dimensional genomic data. A number of tools, such as 
/_ r norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this 
challenge. However, these approaches tend to ignore the vast amount of a priori biological network information 
curated in the literature. 

Results: We propose the use of graph Laplacian regularized logistic regression to integrate biological networks 
into disease classification and pathway association problems. Simulation studies demonstrate that the performance 
of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by 
its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by 
the Cancer Genome Atlas (TCGA) consortium. Many of the protein-protein interaction modules identified by our 
approach are further supported by evidence published in the literature. Source code of the proposed algorithm is 
freely available at http://www.github.com/zhandong/Logit-Lapnet. 

Conclusion: Logistic regression with graph Laplacian regularization is an effective algorithm for identifying key 
pathways and modules associated with disease subtypes. With the rapid expansion of our knowledge of biological 
regulatory networks, this approach will become more accurate and increasingly useful for mining transcriptomic, 
epi-genomic, and other types of genome wide association studies. 



Introduction 

Technologies for high throughput genetic profiling have 
revolutionized the study of human development and dis- 
ease. Expression profiles spanning the entire human 
genome not only allow investigators to better under- 
stand disease subtypes [1], but also define new cate- 
gories associated with sensitivity to pharmacologic 
treatment [2,3] and other clinical outcomes [4,5]. A cen- 
tral problem in these genomic studies is to construct an 
accurate predictive model and delineate specific genes 
or pathways driving a phenotype. Logistic regression is 
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widely used for classification [6-9]. The number of 
genes, however, in high-throughput studies are often 
much larger than the number of specimens in a given 
study. This limitation causes instability in the algorithms 
used to select driver genes and poor performance of 
predictive models. The lasso algorithm for logistic 
regression was introduced to address these problems 
and perform feature selection through Lx-norm regulari- 
zation [10]. However, the number of variables selected 
by the lasso is bounded by the number of observations 
in an experiment. Furthermore, correlated variables are 
rarely selected as part of the predictive model at the 
same time. Various extensions of the Lasso algorithm, 
such as elastic net, pelora, grouped lasso and fused lasso 
introduce grouping or smoothness regularization terms 
to address these limitations [11-14]. Of these, the fused 
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lasso and elastic net have been successfully applied to a 
largest number of gene expression and genome wide asso- 
ciation studies [15-18]. Both fused Lasso and elastic net 
allow correlated genes or neighboring genes on a chromo- 
some to be selected into a predictive model together. 
However, these algorithms tend to ignore functional inter- 
actions between individual gene products documented in 
the scientific literature. Integration of network information 
in the gene marker identification has shown to outperform 
methods without network information by Chuang et al 
[19]. Specifically, their networked-based scoring and 
greedy search algorithm identified more robust gene mar- 
kers with better prediction accuracy on metastasis status 
of breast cancer patients in two cohorts. In recent years, 
vast amounts of data detailing biologic networks has been 
organized into searchable databases. For example, 
BioGRID documents protein and genetic interactions 
from more than 39,991 publications [20]. The KEGG 
pathway database similarly curates molecular interactions 
and relational networks representing systemic functions at 
the level of both the cell and organism [21]. To incorpo- 
rate biological network information into regression mod- 
els, a network-constrained regularization algorithm has 
been previously proposed for use with linear regression 
[22]. Use of this network-constrained algorithm has been 
shown to out-perform both lasso and elastic net analyses 
executed independently of biologic input. 

Classification algorithms integrating network structure 
information have been proposed in other settings. For 
example, a network constrained support vector machine 
has been proposed to analyze functional magnetic reso- 
nance imaging data [23] and cancer microarray data 
[24-26]. Network based prior has also been demon- 
strated to improve variable selection accuracy under the 
Bayesian inference framework [27,28]. 

Here, we propose a graphical Laplacian network regu- 
larized logistic regression method following the frame- 
work established by Li et al. [22]. We hypothesize that 
the integration of biological networks, such as Protein- 
Protein interactions, will improve prediction accuracy 
and variable selection in logistic models. To validate use 
of the proposed algorithm, we studied its theoretical 
properties and compared its performance to Lj-norm 
regularized logistic regression and elastic net logistic 
regression on both simulated and real biologic data. 
Lastly, we demonstrate the utility of the proposed algo- 
rithm by using it to differentiate breast cancer subtypes 
and delineate biologic network modules associated with 
the triple negative breast cancer (TNBC) subtypes. 

Materials and methods 

Graph laplacian regularized logistic regression model 

Suppose that the data set has n observations with p genes. 
Let y = (yi, y n ) T be the response with y, e {0, 1} and 



X = [xi|, |x p ] be the matrix of biomarkers measured on 
n samples with x ( = (xy, x n j) T for / = 1, p genes. 
Without loss of generality, we can assume that each gene 
is standardized. The binary clinical variable, y, can be pre- 
dicted using the logistic model: 

Prob (y=l|X;l)= 1 (1) 
1 + e Kp 

The parameter j3 can be estimated by maximizing the 
log likelihood function of the logistic model. However, it 
is well known that this estimation procedure performs 
poorly for both prediction purposes and variable selec- 
tion when p » n. Although various sparse parameter 
estimation procedures have been introduced to address 
these problems, these approaches tend to generate dis- 
connected biomarkers that are rarely interpretable. 

To incorporate biological network information into the 
model estimation procedure, we adopted the network- 
constrained regularization framework proposed by Li et 
al [22]. Given a biological network G = (V, E), where V is 
the set of genes that correspond to p predictors, E is the 
adjacency matrix, and e uv = 1 if there exists an edge 
between u and v, otherwise e uv = 0. The normalized 
graph Laplacian matrix L for G can be defined by 

L = I-D~5£D~2 ( 2 ) 

where D = diag(£-l p ) is a degree matrix with the diag- 
onal elements equal to the degrees for each node in G. 
For any fixed non-negative regularization parameters A 
and a, we define the logistic graph Laplacian net (Logit- 
Lapnet) criteria as: 

n 

L(X,a, / 5) = ^[-y,X, / 3 + ln(l +e x '' J )] + + \(1 - a )(/3, P) L (3) 

i-i 

Where 

P 

\Ph=J2\Pj\> 

d u , d v are the degrees of nodes u and v respectively. 
In (3), Xi is the ith row of the matrix X. The first term 
in equation (3) is the negative log likelihood function 
of the logistic model. The second term is an i^-norm 
penalization on [}, which encourages sparsity on the 
coefficients. The last term is a generalized L 2 -norm 
penalty using the graph Laplacian matrix, which 
encourages smoothness on coefficients of genes that 
are connected in the biological network. If we set a = 
1, the Logit-Lapnet criteria is equal to the simple lasso 
logistic regression. When 1 = 1 which is identity 
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matrix, the Logit-Lapnet criteria becomes the elastic 
net logistic regression. 

The model estimation can be formulated into a con- 
vex optimization problem: 



/J = argminL(A, a, p 1 ). 

P 



(4) 



To solve problem (4) we used CVX, a package for spe- 
cifying and solving convex programs [29,30]. The con- 
vexity property of Logit-Lapnet guarantees an optimal 
solution using any convex optimization solver. 

Theoretical properties of logit-lapnet 

To study the behavior of the proposed method, we ana- 
lyzed the theoretical properties of Logit-Lapnet. 
Lemma 1 If*i = x j, then j} { = ft for any a < 1 and X > 

0. 

The proof of this lemma is given in the additional file 
1. This lemma states that if two predictors are identical, 
the coefficients of these two predictors will be the same. 
In the Li-norm penalized logistic regression, only one 
predictor is selected. 

Theorem 1 Given data (y, X) and parameters (X, a), 
let p(X, a) be the Logit-Lapnet estimator for problem (4). 
If PiPj > 0 and e t j = 1, define D^{t,j) = ^lA-fAo) -p~j{Kcc)[ 
then 



2X{1 - a) 



(5) 



where P is the sample correlation ofx t and x,- . 

The upper bound in (5) provides a quantitative 
description for the grouping effect of Logit-Lapnet on 
the network structure. For two highly correlated genes 
(p = 1) that are connected in a biological network, the 
difference on the estimated coefficient is almost zero. As 
a goes to 1, the Lapnet becomes lasso logistic regression 
and the difference becomes unbounded. 

Lemma 2 The maximum value of X in problem (4) 
with p ^ Qsatisfies 



1 " 
— (2|X T y| 0O + | ^Xfloo) 



(6) 



The maximum value of X is reciprocal to a. It is clear 
that when a = 1, the Logit-Lapnet criteria becomes Lasso 
and a small penalization can produce an empty model. 
When a = 0, the penalization becomes much larger to 
generate an empty model. In practice, Lemma 2 provides 
a search guidance on the regularization path. The proof 
of Lemma 2 is also given in the additional file 1. 

Gene expression profiles and differential gene analysis 

Gene expression data from breast cancer specimens pro- 
filed by TCGA consortium was used to test the proposed 



Logit-Lapnet method. Level III RNA-Seq data (Illumina 
HigSeq RNASeqV2 from UNC) from patients with inva- 
sive breast carcinoma was obtained from the TCGA data 
portal https://tcga-data.nci.nih.gov/tcga/ in September, 
2012. These data profiled 20501 genes in 806 distinct 
breast cancer specimens. Normalized read counts were 
used for all analyses and were log-transformed prior to 
their use. Genes with normalized read counts less than 
10 in more than 90% of patients were excluded from 
further analyses. The normalized read counts were log- 
transformed and standardized prior to applying the 
method. 

Of the 806 breast cancers characterized by the TCGA, 
261 patients had incomplete information on the three 
markers used to define TNBC and were excluded. Of the 
remaining cancers, 85 TNBC and 460 non-TNBC were 
further separated into training (n = 327; 51 TNBC, 276 
non-TNBC) and test (n = 218; 34 TNBC, 184 non- 
TNBC) sets. A total of 4871 differentially expressed gene 
products (P < 0.05, t-test; > 1.5 fold-change between 
TNBC vs. non-TNBC on training) were merged with pro- 
tein-protein interaction (PPI) networks. This provided us 
with 797 genes and PPI networks of 937 interactions 
available for analysis using our proposed algorithm. 

Protein-protein interaction (PPI) network 

To construct the graph Laplacian matrix for testing the 
Logit-Lapnet method on breast cancer data, networks of 
protein-protein interactions (PPI) identified by two- 
hybrid screening in Homo sapiens were obtained from 
The Biological General Repository for Interaction Data- 
sets (BioGRID, version 3.2.98) http://www.thebiogrid. 
org. At the time it was accessed, BioGRID documented 
21483 interactions between 7700 genes. 

Results and discussion 

Simulation studies 

We initially used a benchmark simulation proposed by 
Li to explore the performance of the proposed Logit- 
Lapnet algorithm [22]. In brief, a network with 200 dis- 
tinct transcription factors (TFs) was simulated. Each TF 
in this simulation regulated 10 genes with a total of 
2,200 genes in the simulated network. The clinical vari- 
able y was assigned a binary value and was associated 
with the first four TFs and their target genes. In model 
I, we assumed that two of the TFs and their targets 
were positively associated with the clinical variable and 
the other two TFs and their targets were negatively 
associated with the clinical variable. 



S = (5, 



' ~M VTo' 's/io yio' ' -Jw Vio 



o o) r 



The clinical variable was defined as y = [Prob (y = 1| 
X; P) > Q] where p ~ £7(0,1). Expression levels for the 
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200 TFs were then simulated using a standard normal 
distribution. Each TF and its target genes were jointly 
distributed as a bivariate normal with correlation of 0.7. 

In model II, gene expression levels were simulated 
similarly to model I except that a TF could be both a 
transcriptional activator and repressor at the same time. 
The coefcient vector was defined as: 

» - (5 ZL ZL ZL -JL -JL _ 5 -JL _JL _JL ZL -5 -3 -3 -3 

' yio' yro' yio' yio vm ' ' yio' vro' yio' yio yip ' ' yio' jm' yip' 

7 7 

3 J J 3 3_ -3_ -3_ - 

yro yio ' ' vio' Vio' vro' yro vm 

7 7 

Model III was similar to model I except that we 
decreased the association of the target genes on the clin- 
ical variable and made the model even sparser. 

t> - (5, ± 1,-5,-— -—,3, — -1,-3,-— -1,0 Of 

v 10 10 10 10 10 10 10 10 ' 



Model IV was similar to model II in allowing tran- 
scription factors to function as both activators and 
repressors. However, the clinical association of the tar- 
get genes was decreased. 

B = fS — 1 — 1 — 1 - -1 -5 -1 -1 -1 --! -L3-L -± -± 
1 ' 10' 10' 10' 10' "' 10 ' ' 10' 10' 10' 10'"" 10 ' ' 10' 10' 10' 

7 7 

3 3 3333 3 lT 

— , —,-3,—,—,—, , ,0, oy 

10 10 10 10 10 10 10 ' 

7 7 

For each model, we simulated both a training data set 
as well as an independent test data set of 100 samples. 
A 10-fold cross-validation procedure was applied to the 
training data set to identify the optimal tuning para- 
meter. Genes with non-zero coefficient in the estimated 
model were found to be associated with the clinical vari- 
able. The sensitivity and specificity of our variable selec- 
tion performance was defined as the following: 

True Negative (TN) := \f!.* fi\o, False Positive (FP) : = \fj. * fi\ 0 

False Negative (FN) := \p. * $\ 0 , True Positive (TP) : = \fi.*$\ 0 (7) 

TP „ TN 

Sensitivity : = TF -^, Specificity : = ^-^ 



where the |-| 0 counts the number of non-zero elements 
in a vector, ft is the logical not operator on a vector and.* 
is the element-wise product. 

We repeated the experiment 50 times. Results are sum- 
marized for each model in Table 1. We also computed 
the Bernoulli error loss on the test data set. For all four 
models, we compared the performance of the proposed 
algorithm to both the Z-i penalized logistic regression and 
the elastic net algorithm. Our method resulted in much 
higher sensitivity in identifying associated genes with the 
same specificity compared to the other two algorithms 
(Table 1). Our method also gave much smaller MSE 
compared to the Lasso and elastic net logistic regression. 
We computed the Receiver Operator Curve on the whole 
regularization path for each of the algorithm. In all four 
models, the proposed algorithm demonstrated much 
higher precision compared to Lasso and elastic net logis- 
tic regression (Figure 1). 

Biomarker identification using logit-lapnet in human 
breast cancers 

Breast cancers are clinically categorized according to the 
expression of several gene products, including the estro- 
gen receptor (ESR1), progesterone receptor (PGR) and 
human epidermal growth factor receptor 2 (ERBB2). 
These biomarkers are routinely used not only to define 
prognosis but also determine treatment. [31,32]. Of the 
breast cancer subtypes defined by these biomarkers, tri- 
ple negative breast cancers (TNBC) lacking expression 
of ESR1, PAR and ERBB2(her2) are the most clinically 
aggressive. TNBC demonstrate high rates of disease pro- 
gression and recurrence [33]. Outcomes for this breast 
cancer subtype are generally poor, largely due to the 
fact that treatment options for women with TNBC are 
limited. However, a subgroup of TNBC are highly sensi- 
tive to conventional chemotherapy [34]. 

We chose to next validate use of our method by test- 
ing its ability to identify the key biomarkers used to 
define breast cancer subtypes. To accomplish this goal, 
we used patterns of gene expression in a large set of 



Table 1 Results of simulations. 







Sensitivity 






Specificity 






Bernoulli Error Loss 




# 


Lasso 


Elastic 


Lapnet 


Lasso 


Elastic 


Lapnet 


Lasso 


Elastic 


Lapnet 


1 


0.3146 


0.4995 


0.7583 


0.9744 


0.9945 


0.9832 


19.24 


17.2 


13.84 




(0.059) 


(0.069) 


(0.071) 


(0.0028) 


(0.002) 


(0.005) 


(0.578) 


(0.537) 


(0.493) 


2 


0.1852 


0.4386 


0.6577 


0.9982 


0.9936 


0.9847 


19.22 


19.62 


16.8 




(0.042) 


(0.076) 


(0.079) 


(0.001) 


(0.003) 


(0.005) 


(0.702) 


(0.671) 


(0.571) 


3 


0.2614 


0.5714 


0.8832 


0.9887 


0.9893 


0.9537 


15.14 


16.44 


14.38 




(0.045) 


(0.068) 


(0.066) 


(0.0024) 


(0.0026) 


(0.006) 


(0.582) 


(0.585) 


(0.495) 


4 


0.2314 


0.6755 


0.8645 


0.9927 


0.9583 


0.9492 


17.38 


18.86 


16.1 




(0.043) 


(0.0712) 


(0.069) 


(0.0019) 


(0.0045) 


(0.009) 


(0.511) 


(0.572) 


(0.570) 



The results for simulations. Sensitivity, specificity and PMSEs are based on 50 simulations. The standard errors are given in parentheses. 
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Model 1 




0.05 0.10 0.15 

False positive rate 

Model 3 



0.20 




0.05 0.10 0.15 

False positive rate 



0.20 






Lasso 




ENet 




LapNet 



0.05 0.10 0.15 

False positive rate 

Model 4 



0.20 




0.05 0.10 0.15 

False positive rate 



0.20 



Figure 1 Receiver operator curves (ROCs) on the regularization path. ROCs were computed on the regularization path for Logit-Lapnet, 
enet, and lasso for all four models (A)-(D). Logit-Lapnet has higher true positive rate and lower false positive rate compared to the other two 
approaches. 



breast cancer specimens recently profiled by the Cancer 
Genome Atlas consortium. The tuning parameter for 
each method was selected through 10-fold cross-valida- 
tion using the training data. We observed the same clas- 
sification accuracy in predicting TNBC tumors (95%) 
when lasso, elastic net and Logit-Lapnet were applied to 
the testing data. The similar efficacy of these 3 algo- 
rithms is not surprising since TNBC tumors are drama- 
tically different from the non-TNBCs in terms of their 
gene expression profiles. 

However, there were a number of significant differ- 
ences observed between the results obtained with the 3 
algorithms tested. For example, use of Logit-Lapnet 
selected 262 genes, of which > 63% (166 genes) 



interacted with one another (Figure 2A). In comparison, 
use of the Lasso algorithm selected only 24 genes, most 
of which (20 genes) were isolated and were not pre- 
dicted to interact. Elastic net selected nearly half of the 
input genes (393 genes) of which 59% (230 genes) were 
interconnected (Additional File 2). Furthermore, neither 
Lasso nor elastic net identified the progesterone recep- 
tor as a key discriminator for TNBC, despite the fact 
that this gene product is routinely used to clinically 
categorize breast cancers. Only Logit-Lapnet successfully 
identified each of the three markers used to define 
TNBC subtype: ESR1, PGR and ERBB2. These results 
suggest that Logit-Lapnet is more accurate than either 
Lasso or elastic net for identifying biomarkers from 
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(A) 

Selected Genes Selected Interconnected Genes 

Elastic Net 



Logit-Lapnet 19g 

39 170 



Logit-Lapnet Elastic Net 

56 120 



Lasso 



(B) Lasso 

II 



(C) Logit-Lapnet 

Hormone receptor 
signaling 

Estrogen / M * 

receptor (ER) ^^^ \ I 

NCOA2 



1 



Expression in TN8C 
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Subnet 2 
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HMGA1 \ ^ 
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TP53BP2 

MAPK8IP2 FSCN1 



TLE1 

KB1B 



T H 



HER2 gene 
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It 



i _ » « » 



■Ft. 



Figure 2 Application of the algorithm to identify TNBC-associated genes. Genes and subnetworks of PPI associated with TNBC using TCGA 
breast cancer data and BioGRID PPI. Comparison of the selected genes from our proposed algorithm with those from lasso and elastic net (A). 
Genes and their respective subnetworks of PPI recovered by lasso (B) and our proposed Laplacian net algorithm (C). In the networks, genes 
having association to breast cancer reported in the literature are labeled with larger font. 



large multidimensional genomic datasets such as those 
generated by the TCGA. 

Use of logit-lapnet for inferring novel biologic 
relationships 

In addition to identifying each of the 3 key biomarkers 
used to categorize breast cancer subtypes, we found that 
our method identified multiple subnetworks of gene 
expression in the breast cancer specimens profiled by 
TCGA. These subnetworks potentially reflect relation- 
ships with clinical or biologic significance. For example, 
one of subnetworks we identified includes multiple 
genes (AR, ESR1, MED1, MED24, RARA, PRAME, and 
HMGA1) involved in steroid hormone signaling. As 
demonstrated in Figure 2, Logit-Lapnet found that this 
cluster is closely connected to a known breast cancer 
gene, GATA3. This connection suggests a functional 
relationship. Evidence to support this relationship could 
not be found by directly searching the database of pro- 
tein-protein interactions used to construct our algo- 
rithm. However, at least one report published since our 
initial analyses has now shown that GATA3 mediates 



genomic ESR1 -binding upstream of FOXA1 [35]. This 
confirms that the integration of genomic and PPI data 
by our method has the capacity to identify new and 
otherwise unanticipated relationships with biologic 
significance. 

Enhanced network specificity provided by logit-lapnet 

Another advantageous feature of Logit-Lapnet is the 
potential functional specificity of the subnetworks deli- 
neated by our algorithm. Subnetworks identified by 
Logit-Lapnet allow investigators to more readily focus 
on key genes for subsequent downstream functional 
analyses. For example, the subnetwork connecting AR, 
ESR1 and RARA identified by our algorithm in TNBC 
represents a cluster of genes involved in hormone recep- 
tor signaling (Figure 2). The subnetwork 2 identified by 
our method contains the genes PLK1, BCL2, BNIPL, 
and tumor suppressor p53 binding protein TP53BP2. 
PLK1, BNIPL and BCL2 are well-known oncogenes and 
part of TP53 pathway [36]. The expression of BCL2 has 
been proposed as a prognostic marker for breast cancer 
patients. This subnetwork is also interesting, as it 



Zhang et al. BMC Genomics 2013, 14(Suppl 8):S7 
http://www.biomedcentral.eom/1 471 -2 1 64/1 4/S8/S7 



Page 7 of 8 



predicts a relationships between gene products that have 
been previously shown to impact the Gl-S (TP53BP) 
and G2-M cell cycle checkpoints (PLK1). Dysregulation 
of both Gl-S and G2-M are key hallmarks of human 
cancer as defined by Weinberg and others. Furthermore, 
TP53 dysfunction has been previously shown to lead to 
the overexpression of PLK1 and other gene products 
important for driving cells with genomic instability 
through the cell cycle. Thus, the ability of Logit-Lapnet 
to detect this relationship underscores its capacity to 
detect key events in breast and other human cancers. 
These results convey a clear message that interactions 
between genes in clusters might be biologically relevant. 

In contrast, alternative algorithms such as elastic net 
identify fewer subnetworks containing larger number of 
genes. This means that subnetworks identified by Logit- 
Lapnet allow investigators to more readily focus on key 
targets for subsequent downstream functional validation. 
Although network modularization methods can be used 
to further dissemble networks defined by elastic net into 
smaller modules, these smaller modules may not reflect 
direct interactions between their individual components. 
This is because elastic net defines groups of co- 
expressed genes rather than networks of functionally 
interacting gene products. Alternatively, use of Logit- 
Lapnet can be considered to place gene correlations in 
the context of biologic function. Thus, its use can be 
reasonably anticipated to define relationships that are 
more likely to be biologically and clinically relevant. 

In summary, the proposed algorithm identified set of 
genes associated with breast tumors. In addition, inte- 
gration of PPI in the algorithm enables us to recover the 
genes not only in association to the breast cancer sub- 
types but also their interacting partners which are also 
breast cancer related. Most importantly, from the com- 
parison with elastic net and lasso, our method selected a 
reasonable size of genes and is the only algorithm cap- 
able of identified all three marker genes in defining 
TNBC. 

Conclusion 

We have developed a graph Laplacian regularized logis- 
tic regression model for selecting the genes or network 
modules associated with clinical variables from gene 
expression profiles. Through simulation studies, we have 
demonstrated that our approach is much more sensitive 
for identifying clinically relevant genes and network 
modules. We have presented a case study on network 
module identification for triple negative breast cancer 
sub-type using mRNA-seq data. Our results indicate 
that Logit-Lapnet is a superior algorithm compared to 
Lasso and elastic net in disease gene and network mod- 
ule selection. Further work to biologically validate our 
predicted modules is needed to gain a more complete 



picture of the regulatory process in the TNBC sub-type. 
Beyond genomics, there are many potential applications 
of Logit-Lapnet to utilize the rich information provided 
by network structure, such as metabolomics and proe- 
teomics studies. In conclusion, our work developing the 
network structure regularized logistic regression model 
has many implications and has provided a new tool for 
research in genomic studies. 

Additional material 



Additional file 1: Proofs of lemma and theorem This file includes the 
mathematics proofs on lemma 1, 2 and theorem 1. 

Additional file 2: Genes identified by elastic net. This file includes a 
figure of genes and their respective subnetworks of PPI identified by 
elastic net. Genes with larger font indicates its association to breast 
cancer reported in the literature. 
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